[SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors.#18680
[SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors.#18680ueshin wants to merge 8 commits intoapache:masterfrom
Conversation
| @Override | ||
| public boolean[] getBooleans(int rowId, int count) { | ||
| assert(dictionary == null); | ||
| NullableBitVector.Accessor accessor = boolData.getAccessor(); |
There was a problem hiding this comment.
Can we use nulls? Ditto for other places.
There was a problem hiding this comment.
I'm afraid not, because the type of nulls is ValueVector.Accessor which has only simple methods such as isNull().
The concrete accessor APIs are different for each types.
Or should we cast nulls to the concrete type each time?
There was a problem hiding this comment.
I see. Can we keep NullableBitVector.Accessor instead of NullableBitVector while we keep the same reference in two instance variables. I am afraid about the cost of runtime cast in getBoolean() method rather than getBooleans() method.
This is why I expect get() method will be inlined into by a JIT compiler since each Accessor class is final.
|
|
||
| @Override | ||
| public boolean getBoolean(int rowId) { | ||
| return boolData.getAccessor().get(rowId) == 1; |
There was a problem hiding this comment.
Can we use nulls? Ditto for other places
|
Test build #79752 has finished for PR 18680 at commit
|
| */ | ||
| public abstract class ReadOnlyColumnVector extends ColumnVector { | ||
|
|
||
| protected ReadOnlyColumnVector(int capacity, MemoryMode memMode) { |
There was a problem hiding this comment.
Is there any reason not to accept dataType as one of argument? To have the argument would be more flexible for future usages.
There was a problem hiding this comment.
I see, I'll modify it to accept dataType but I guess we shouldn't pass it to ColumnVector to avoid illegally allocating child columns.
|
Test build #79763 has finished for PR 18680 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Thanks @ueshin for this. I made a first pass, I see a lot of things are scoped to public - is this intended to be a public API?
| case _ => throw new UnsupportedOperationException(s"Unsupported data type: $dt") | ||
| } | ||
|
|
||
| def toArrowField(name: String, dt: DataType, nullable: Boolean): Field = { |
There was a problem hiding this comment.
Is this only used for testing?
There was a problem hiding this comment.
No, this is used to create an Arrow schema from StructType in ArrowUtils .toArrowSchema(), too.
|
|
||
| import org.apache.spark.sql.types._ | ||
|
|
||
| object ArrowUtils { |
There was a problem hiding this comment.
shouldn't this be private[sql]? also in other places
| /** | ||
| * A column backed by Apache Arrow. | ||
| */ | ||
| public final class ArrowColumnVector extends ReadOnlyColumnVector { |
There was a problem hiding this comment.
Is this planned to be a public API right now?
| } | ||
| resultStruct = new ColumnarBatch.Row(childColumns); | ||
| } else { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Can this whole "if else" block be put into a pattern match instead?
There was a problem hiding this comment.
Unfortunately, this class is written in Java, so we can't use a pattern match.
| /** | ||
| * An abstract class for read-only column vector. | ||
| */ | ||
| public abstract class ReadOnlyColumnVector extends ColumnVector { |
There was a problem hiding this comment.
Wouldn't it be better to refactor ColumnVector into classes that separate reading/writing so you could just extend the read portion instead of making this class that throws exceptions on writes? e.g.
ColumnVector -> ColumnVectorWritable -> ColumnVectorReadable
ArrowColumnVector -> ColumnVectorReadable
There was a problem hiding this comment.
I agree that it'd be better to refactor ColumnVector, but I think ColumnVector is related to ColumnarBatch or other classes, so we should do it, and also refactor ColumnarBatch at the same time, in the future PRs.
There was a problem hiding this comment.
+1 on separating the read/write, we should definitely do this before we publish the ColumnVector interfaces.
|
@BryanCutler Thank you for reviewing! |
|
Test build #79787 has finished for PR 18680 at commit
|
|
@BryanCutler all classes under the |
| import org.apache.spark.unsafe.types.UTF8String; | ||
|
|
||
| /** | ||
| * A column backed by Apache Arrow. |
| public boolean[] getBooleans(int rowId, int count) { | ||
| boolean[] array = new boolean[count]; | ||
| for (int i = 0; i < count; ++i) { | ||
| array[i] = accessor.getBoolean(rowId + i); |
There was a problem hiding this comment.
we don't need to address this now, but do we have a better implementation with arrow? cc @BryanCutler
There was a problem hiding this comment.
kind of a batch read API.
There was a problem hiding this comment.
I checked Arrow's API docs. I didn't find batch read API.
| childColumns = new ColumnVector[1]; | ||
| childColumns[0] = new ArrowColumnVector(listVector.getDataVector()); | ||
| resultArray = new Array(childColumns[0]); | ||
| } else if (vector instanceof MapVector) { |
There was a problem hiding this comment.
a unrelated question: why a vector for struct type is called MapVector in arrow? cc @BryanCutler
There was a problem hiding this comment.
I'm not sure about the design decision behind it, but it's meant to lookup child vectors by name so uses a kind of hash map. I agree that another name would have been more intuitive.
|
|
||
| @Override | ||
| final int getArrayLength(int rowId) { | ||
| return accessor.get(rowId + 1) - accessor.get(rowId); |
There was a problem hiding this comment.
If the given rowId is the last row, is it still valid to call get(rowId + 1)?
There was a problem hiding this comment.
Yes, the offset vector for ListVector should have num of arrays + 1 values.
|
LGTM, pending jenkins |
|
Test build #79793 has finished for PR 18680 at commit
|
|
thanks, merging to master! |
|
Have you guys checked the performance of this change? It changes the number of concrete implementations for column vector from 2 to 3 (and potentially 1 to 2 at runtime). This might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. (It depends on how we can column vector). |
… for Arrow vectors. ## What changes were proposed in this pull request? This is a follow-up of #18680. In some environment, a compile error happens saying: ``` .../sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java:243: error: not found: type Array public void loadBytes(Array array) { ^ ``` This pr fixes it. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18701 from ueshin/issues/SPARK-21472_fup1.
…ctor type. ## What changes were proposed in this pull request? As mentioned at apache#18680 (comment), when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#18989 from ueshin/issues/SPARK-21781.
What changes were proposed in this pull request?
Introducing
ArrowColumnVectoras a reader for Arrow vectors.It extends
ColumnVector, so we will be able to use it withColumnarBatchand its functionalities.Currently it supports primitive types and
StringType,ArrayTypeandStructType.How was this patch tested?
Added tests for
ArrowColumnVectorand existing tests.